Vamsi Pasala - Personal Loan Campagin

Description

Background and Context

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective

Data Description

Data Dictionary

1. Loading libraries

Import required libraries for the project.

Load data

Copy loaded data to new variable

View the first and last 5 rows of the dataset.

View the first 10 rows of the dataset by Sorting by Age.

Summary of the dataset.

Data Imputation

Converting values from 1 to 1000 as features 'Income','CCAvg','Mortgage' single values

Univariate Analysis

Bivariate analysis

Summary of EDA

Data Description:

Data Cleaning:

Observations from EDA:

Actions for data pre-processing:

3. Data Pre-Processing

Dropping ID and High Corelated value Age

Creating training and test sets.

Building the model

Model evaluation criterion

Model can make wrong predictions as:

  1. Predicting that Personal_Loan should be approved for a customer but in reality the Persoanl_Loan is not approved for the customer.
  2. Predicting that a personal_loan shouldn't approve for a customer in reality the Personal_Loan is approved for the customer.

Which case is more important?

How to reduce this loss i.e need to reduce False Negatives?

First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.

Logistic Regression

Finding the coefficients

Coefficient interpretations

Converting coefficients to odds

Odds from coefficients

Coefficient interpretations

Interpretation for other attributes can be done similarly.

Checking model performance on training set

Checking performance on test set

ROC-AUC

Model Performance Improvement

Optimal threshold using AUC-ROC curve

Checking model performance on training set

Checking model performance on test set

Let's use Precision-Recall curve and see if we can find a better threshold

Checking model performance on training set

Checking model performance on test set

Model Performance Summary

Build Decision Tree Model

Checking model performance on training set

Checking model performance on test set

Visualizing the Decision Tree

Reducing over fitting

Using GridSearch for Hyperparameter tuning of our tree model

Checking performance on training set

Checking performance on test set

Visualizing the Decision Tree

Cost Complexity Pruning

The DecisionTreeClassifier provides parameters such as min_samples_leaf and max_depth to prevent a tree from overfiting. Cost complexity pruning provides another option to control the size of a tree. In DecisionTreeClassifier, this pruning technique is parameterized by the cost complexity parameter, ccp_alpha. Greater values of ccp_alpha increase the number of nodes pruned. Here we only show the effect of ccp_alpha on regularizing the trees and how to choose a ccp_alpha based on validation scores.

Total impurity of leaves vs effective alphas of pruned tree

Minimal cost complexity pruning recursively finds the node with the "weakest link". The weakest link is characterized by an effective alpha, where the nodes with the smallest effective alpha are pruned first. To get an idea of what values of ccp_alpha could be appropriate, scikit-learn provides DecisionTreeClassifier.cost_complexity_pruning_path that returns the effective alphas and the corresponding total leaf impurities at each step of the pruning process. As alpha increases, more of the tree is pruned, which increases the total impurity of its leaves.

Next, we train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

For the remainder, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.

Maximum value of Recall is at 0.06 alpha, but if we choose decision tree will only have a root node and we would lose the buisness rules, instead we can choose alpha 0.015 retaining information and getting higher recall.

checking performance on training set

checking performance on test set

Visualizing the Decision Tree

Creating model with 0.002 ccp_alpha

Checking performance on the training set

Checking performance on the test set

Visualizing the Decision Tree

Comparing all the decision tree models

Conclusion

Recommendations